Are We in Kansas Anymore?

data viz
analysis
Author

Andrew Carr

Published

November 1, 2019

In this post, I examine how Hollywood film has changed over the past few decades. I look at the changing relationship between genre and movie box office returns, shifts in the representation of men and women among top-billed actors, and the relationship between critical and commercial success. I conduct these analyses using data that I collected through Wikipedia’s APIs. The data consists of 9712 movies. The population frame is all movies with Wikipedia entries released in the United States between 1980 and 2019.

Data

Wikipedia has a set of APIs that allows users to collect almost anything from the site. My data comes from a group of pages that have the headline “List of American films of [a year]”. Each of these pages has tables with movie titles and links to their pages. By drawing from these, I collected a list of names and links for 9712 movies and pulled information from the infobox of each movie page. Here’s what the infobox looks like for Next, a timeless cinematic masterpiece starring Nicolas Cage as a small-time magician who can see exactly two minutes into the future.

For each movie, I collected the release date, box office, budget, runtime, directors, and top-billed actors from the infobox. I also gathered links to the pages of top-billed actors in each movie. I collected additional information by examining main body of movie pages. Most movie pages have a “Critical Reception” section that has a movie’s Rotten Tomotoes score and the number of reviews on which this score is based. I also extracted movie genre from the introduction of each movie page. Finally, I used a set of rules for extracting where the film was set from the film synopsis. Let’s have a look at the columns of the data.

colnames(movie_metadata_tbl)
 [1] "name"            "name_lab"        "director"        "director_link"  
 [5] "genre_cat"       "runtime"         "budget"          "budget_adj"     
 [9] "box_office"      "box_office_adj"  "profit_adj"      "profit_lab"     
[13] "review"          "num_review"      "date"            "year"           
[17] "month"           "day"             "year_fin"        "cast"           
[21] "cast_link"       "cast_race"       "cast_gender"     "cast_age"       
[25] "cast_age_gender" "cast_bday"       "tot_white"       "tot_black"      
[29] "tot_hisp"        "tot_asian"       "white_prop"      "black_prop"     
[33] "hisp_prop"       "asian_prop"      "race_tots"       "tot_man"        
[37] "tot_woman"      

This dataset has movie name, director and director link, genre, runtime, budget and box office information, Rotten Tomatoes review information, and release date information. After that, there is a set of columns that are nested lists containing data on top-billed actors in each movie. These lists contain actors’ names, links to their Wikipedia pages, race, gender, age, birthday, and more. Finally, there are several columns of movie-level actor data, including the proportion Black of top-billed actors who are Black and the total number of women among top-billed actors. Let’s start with some exploratory data analysis. Here are the top ten highest-grossing Hollywood movies according to the data.

movie_metadata_tbl %>%
  arrange(desc(box_office)) %>% 
  slice(1:10) %>% 
  pull(name_lab)
 [1] "Avengers: Endgame"            "Avatar"                      
 [3] "Titanic"                      "Star Wars: The Force Awakens"
 [5] "Avengers: Infinity War"       "Jurassic World"              
 [7] "The Lion King"                "The Avengers"                
 [9] "Furious 7"                    "Avengers: Age of Ultron"     

Let’s see how this list compares to an inflation-adjusted list of highest grossing films.

movie_metadata_tbl %>%
  arrange(desc(box_office_adj)) %>% 
  slice(1:10) %>%
  pull(name_lab)
 [1] "Titanic"                      "Avatar"                      
 [3] "Avengers: Endgame"            "Star Wars: The Force Awakens"
 [5] "E.T. the Extra-Terrestrial"   "Avengers: Infinity War"      
 [7] "Jurassic Park"                "Jurassic World"              
 [9] "The Avengers"                 "The Empire Strikes Back"     

Adjusting for inflation vaults James Cameron to the top of the list with Titanic and Avatar. Next, I pull the longest and shortest movies from the data.

paste("Longest: ", movie_metadata_tbl %>% 
        arrange(desc(runtime)) %>% pull(name_lab) %>% .[1])
[1] "Longest:  The Cure for Insomnia"
paste("Shortest: ", movie_metadata_tbl %>% 
        arrange(runtime) %>% pull(name_lab) %>% .[1])
[1] "Shortest:  Luxo Jr."

The Cure for Insomnia is an 87-hour long experimental film that consists of an artist reading a 4,080-page poem. It held the Guiness record for longest film before being supplanted by a non-American movie. Luxo Jr. is a two minute long animated film released by Pixar in 1986 that was the first CGI movie to be nominated for an Oscar. We can also look at which actors appear most in the data.

movie_metadata_tbl$cast_link %>% 
  unlist %>% 
  table %>%
  sort(decreasing = TRUE) %>% 
  head(5)
.
 /wiki/Samuel_L._Jackson       /wiki/Bruce_Willis       /wiki/Nicolas_Cage 
                      76                       67                       65 
    /wiki/Robert_De_Niro /wiki/Christopher_Walken 
                      65                       62 

It turns out that Samuel L. Jackson is the hardest working actor in show business, with 76 top billings since 1980. Jackson has this distinction on lock, holding a nine-film lead on Unbreakable co-star Bruce Willis.

What other amusing outliers can we find in the data? How about worst movie of all time? I get this by filtering the data to movies that have received at least 40 Rotten Tomatoes reviews and sorting by average Rotten Tomatoes score.

movie_metadata_tbl %>% 
  filter(num_review > 40) %>% 
  arrange(review) %>%
  pull(name) %>% 
  head(10)
 [1] "Pinocchio_(2002_film)"             "National_Lampoon%27s_Gold_Diggers"
 [3] "One_Missed_Call_(2008_film)"       "A_Thousand_Words_(film)"          
 [5] "Gotti_(2018_film)"                 "The_Master_of_Disguise"           
 [7] "Twisted_(2004_film)"               "Alone_in_the_Dark_(2005_film)"    
 [9] "Daddy_Day_Camp"                    "Disaster_Movie"                   

These movies all received either a 0% or 1% on Rotten Tomatoes based on 40 or more reviews. There are some derivative horror movies (One Missed Call, Alone in the Dark) and tasteless comedies (Disaster Movie, National Lampoon’s Gold Diggers) here. We see movies that have ended careers (Roberto Benini as Pinocchio in Pinocchio, Cubo Gooding Jr. in Daddy Day Camp). My favorite on this list is Dana Carvey’s incredibly misguided attempt to capitalize on the success of Michael Myer’s Austin Powers with The Master of Disguise.

Recap

That concludes our journey through forty years of Hollywood film. I hope you learned a thing or two. Please reach out to me if you have any questions about how I created these plots or the underlying data.